Range Indexing

EMu 3.1 saw the addition of several indexing methods to the database engine. In particular support for NULL indexing (whether a field is empty or not) and PARTIAL indexing (fast searching for leading characters, e.g. a*) was added. Tools were provided that allowed System Administrators to configure, via the EMu Registry, which fields required the new indexing methods. The new indexing facilities did not provide a mechanism for adjusting or tuning range indexes.

EMu 4.0.01 saw the introduction of additional tools that permit System Administrators to tune the range indexing used by EMu. Support for automatic optimization of range indexes has also been added. Using these tools EMu can now provide optimal range indexes with significantly faster range based searches.

Note: Included with the following description of EMu Range Indexing, you will find details about Range Indexing Registry entries. If you are specifically looking for details of Range Indexing Registry entries, see Range Buckets Registry entry.

The range indexing employed by the EMu database server is designed to work with the Two Level Superimposed Coding Scheme for Partial Match Retrieval system used for all searching. As it is an extension of the standard indexing EMu still requires only one set of index files, thus avoiding the need for costly "join" based queries.

Range indexing is really a series of "mini" indexes on a per field basis. Unlike the Two Level scheme, where all search terms are placed in the one index, range terms are placed in per field indexes that are then concatenated to form one range index. Each field index consists of a number of range buckets. These buckets are used to indicate whether a given value falls within the bucket or not.

There are two related considerations when establishing range buckets:

  1. Data distribution: as best as possible data should be distributed evenly between the buckets.
  2. Logical query ranges. The aim is to minimize the necessity to check a bucket for a value as checking whether records in a bucket match the query takes time; therefore if users are likely to search on particular ranges (decades for instance: 1/1/1910 to 1/1/1920) it makes sense to configure range buckets appropriately.

An example may make things clearer. Consider Width: (Measurement Details) in the Catalog module. Let's say that the width of photographs is generally from 10cm to 50cm so we establish the following range buckets:

Index

Note:

  1. -infinity and +infinity will capture any values outside the specified ranges.

When a range search is performed these buckets are used to determine possible matches. Consider the following searches for photographs with a width of between:

  1. 15cm to 35cm:

    Index

    In this case:

    1. Three buckets can be safely ignored.
    2. One bucket does not need to be checked and is definitely included in the search.
    3. Two buckets need to be checked for a match.
  2. 15cm to 30cm:

    Index

    In this case:

    1. Four buckets can be safely ignored.
    2. One bucket does not need to be checked and is definitely included in the search.
    3. One bucket needs to be checked.
  3. 10cm to 35cm:

    Index

    In this case:

    1. Three buckets can be safely ignored.
    2. Two buckets do not need to be checked and are definitely included in the search.
    3. One bucket needs to be checked.
  4. 10cm to 30cm:

    Index

    In this case:

    1. Four buckets can be safely ignored.
    2. Two bucket do not need to be checked and are included in the search.
    3. No buckets need to be checked.

    In this case, no buckets need to be checked!

It should be clear from the last example that determining range buckets that match likely user searches has a direct influence on the speed of range based queries.

The problem with setting the range bucket values is that the bucket values depend on two variables. The first is the range of values entered into a field, in other words the data distribution. If the Width field contains a wide range of values, it makes sense to have a wide range of buckets. If, however, the data is centered around one point, it may make sense to have a series of range buckets cover this period.

The second variable is the query ranges used to retrieve data. The problem with this variable is that for a given field it may be very hard to determine what sort of query ranges will be used without performing extensive analysis.

As these two variables cannot be known before data has been loaded EMu provides a default set of range buckets for each range searchable field. In general these buckets are satisfactory for a small numbers of records, however significant reductions in search times may be achieved by "tuning" the range buckets for large numbers of records.

Tuning the range indexes involves considering the two variables, data distribution and query ranges, and for each variable determining the set of range buckets that provides optimal performance. The objective in fine-tuning range indexing is to:

  1. Minimize the number of buckets that need to be searched.
  2. Minimize the number of records in buckets that need to be searched. Note that this objective will take precedence over the default distribution of records evenly between buckets.

EMu 4.0.01 provides a mechanism that allows System Administrators to have the range buckets tuned automatically based on the data distribution within a field. The facility does not take query ranges into account as this information requires subjective interpretation to determine which queries are important to have optimized and which queries can run a little slower.

The new facility does however provide data distribution information that may be used by a System Administrator to set the range buckets manually to achieve optimal performance where query ranges are known and can be weighted.

Hence the new facility provides automated range buckets, but allows System Administrators to override these settings and configure their own buckets manually.